Phraserate: an Html Keyphrase Extractor *
نویسنده
چکیده
A standard feature in cataloging documents is the list of keywords. When the source documents are web pages, we can attempt to aid the cataloger by analyzing the page and presenting relevant support material. Since the keywords that occur in a document generally occur in keyphrases, and keyphrases provide contextual material for reviewing candidate keywords, they are a natural aggregate to extract and present to the cataloger. This paper describes PhraseRate, which is an interactive aid for keyword extraction, designed to assist human classifiers in the Infomine Project (http://infomine.ucr.edu/). In particular, it introduces a novel keyphrase extraction heuristic for web pages which requires no training, but instead is based on the assumption that most well written webpages “suggest” keyphrases based on their internal structure. It is very fast, flexible, and its results compare favorably with the state of the art in keyphrase extraction.
منابع مشابه
Adaptation of a Keyphrase Extractor for Japanese Text*
This paper presents some statistical observations relevant to Japanese keyphrase extraction, as well as the details of the implementation of a keyphrase extraction algorithm (called Extractor) for Japanese documents. Parts of the algorithm include an efficient method of extracting the keyphrase candidates, a way to pinpoint the most probable keyphrases using contextual information, a technique ...
متن کاملKeyphrase Extraction : Enhancing Lists
This paper proposes some modest improvements to Extractor, a state-of-the-art keyphrase extraction system, by using a terabyte-sized corpus to estimate the informativeness and semantic similarity of keyphrases. We present two techniques to improve the organization and remove outliers of lists of keyphrases. The first is a simple ordering according to their occurrences in the corpus; the second ...
متن کاملNoun Compound and Named Entity Recognition and their Usability in Keyphrase Extraction
We investigate how the automatic identification of noun compounds and named entities can contribute to keyphrase extraction and we also show how previously identified noun compounds affect named entity recognition and vice versa, how noun compound detection is supported by identified named entities. Our experiments demonstrate that already known noun compounds yield better performance in named ...
متن کاملDegExt - A Language-Independent Graph-Based Keyphrase Extractor
In this paper, we introduce DegExt, a graph-based languageindependent keyphrase extractor,which extends the keyword extraction method described in [6]. We compare DegExt with two state-of-the-art approaches to keyphrase extraction: GenEx [11] and TextRank [8]. Our experiments on a collection of benchmark summaries show that DegExt outperforms TextRank and GenEx in terms of precision and area un...
متن کاملSupervised Keyphrase Extraction as Positive Unlabeled Learning
The problem of noisy and unbalanced training data for supervised keyphrase extraction results from the subjectivity of keyphrase assignment, which we quantify by crowdsourcing keyphrases for news and fashion magazine articles with many annotators per document. We show that annotators exhibit substantial disagreement, meaning that single annotator data could lead to very different training sets ...
متن کامل